# Loading Required Libraries
# Reading Data
epa22 <- read.csv("data/2022_EPA.csv")
epa02 <- read.csv("data/2002_EPA.csv")
# Checking Dimensions
dim(epa22)[1] 59918 22
dim(epa02)[1] 15976 22
Primary question to answer- Has the daily concentrations of PM2.5 decreased in California in the 20 years spanning 2002 - 2022?
Why answer this question?
PM2.5 is highly dangerous to health, and a quick glance at the epa.gov website and Google explains that the primary health risks associated with the inhalation of PM2.5 include respiratory issues like asthma exacerbation and bronchitis, cardiovascular problems and even premature death in highly sensitive individuals.
Answering whether PM2.5 has decreased in the state of California over the 2-decade span between 2002 and 2022 will help us analyse if the EPA and US Government’s efforts to improve air quality have been successful, and can provide a valuable starting point for any efforts we take moving forward to reduce air pollution.
EDA Checklist items 1 through 5.
1.1: Read in the data
1.2: Check the size of the data
1.3: Examine the variables and their types
1.4: Look at the top and bottom of the data
1.5: Visualize the distribution of the key variables
The aim is to perform steps 1.1-1.5 for the 2002 and 2022 data from the EPA website.
1.1: Read in the data + 1.2: Check data size [The files for 2002 and 2022 (California, All Sites) were downloaded from the EPA website, and the csv files were renamed to “2022_EPA” and “2002_EPA”, and saved into the .gitignore folder in my repository.]
# Loading Required Libraries
# Reading Data
epa22 <- read.csv("data/2022_EPA.csv")
epa02 <- read.csv("data/2002_EPA.csv")
# Checking Dimensions
dim(epa22)[1] 59918 22
dim(epa02)[1] 15976 22
EPA 2022 data has 59918 rows and 22 columns EPA 2002 data has 15976 rows and 22 columns
1.3: Variables and their types
# Variable Names and Types
print(paste("EPA 2022 Variables:", str(epa22)))'data.frame': 59918 obs. of 22 variables:
$ Date : chr "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
$ Source : chr "AQS" "AQS" "AQS" "AQS" ...
$ Site.ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
$ POC : int 3 3 3 3 3 3 3 3 3 3 ...
$ Daily.Mean.PM2.5.Concentration: num 12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
$ Units : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ Daily.AQI.Value : int 58 60 39 21 23 21 13 38 59 55 ...
$ Local.Site.Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
$ Daily.Obs.Count : int 1 1 1 1 1 1 1 1 1 1 ...
$ Percent.Complete : num 100 100 100 100 100 100 100 100 100 100 ...
$ AQS.Parameter.Code : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
$ AQS.Parameter.Description : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ Method.Code : int 170 170 170 170 170 170 170 170 170 170 ...
$ Method.Description : chr "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
$ CBSA.Code : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
$ CBSA.Name : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ State.FIPS.Code : int 6 6 6 6 6 6 6 6 6 6 ...
$ State : chr "California" "California" "California" "California" ...
$ County.FIPS.Code : int 1 1 1 1 1 1 1 1 1 1 ...
$ County : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
$ Site.Latitude : num 37.7 37.7 37.7 37.7 37.7 ...
$ Site.Longitude : num -122 -122 -122 -122 -122 ...
[1] "EPA 2022 Variables: "
print(paste("EPA 2002 Variables:", str(epa02)))'data.frame': 15976 obs. of 22 variables:
$ Date : chr "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
$ Source : chr "AQS" "AQS" "AQS" "AQS" ...
$ Site.ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ Daily.Mean.PM2.5.Concentration: num 25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
$ Units : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ Daily.AQI.Value : int 81 93 74 82 98 115 89 62 69 107 ...
$ Local.Site.Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
$ Daily.Obs.Count : int 1 1 1 1 1 1 1 1 1 1 ...
$ Percent.Complete : num 100 100 100 100 100 100 100 100 100 100 ...
$ AQS.Parameter.Code : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
$ AQS.Parameter.Description : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ Method.Code : int 120 120 120 120 120 120 120 120 120 120 ...
$ Method.Description : chr "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
$ CBSA.Code : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
$ CBSA.Name : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ State.FIPS.Code : int 6 6 6 6 6 6 6 6 6 6 ...
$ State : chr "California" "California" "California" "California" ...
$ County.FIPS.Code : int 1 1 1 1 1 1 1 1 1 1 ...
$ County : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
$ Site.Latitude : num 37.7 37.7 37.7 37.7 37.7 ...
$ Site.Longitude : num -122 -122 -122 -122 -122 ...
[1] "EPA 2002 Variables: "
Both data files (2022 and 2002) have the same 22 variables of the same data type
1.4: Top and bottom of the data
# Headers and Footers
head(epa22) Date Source Site.ID POC Daily.Mean.PM2.5.Concentration Units
1 01/01/2022 AQS 60010007 3 12.7 ug/m3 LC
2 01/02/2022 AQS 60010007 3 13.9 ug/m3 LC
3 01/03/2022 AQS 60010007 3 7.1 ug/m3 LC
4 01/04/2022 AQS 60010007 3 3.7 ug/m3 LC
5 01/05/2022 AQS 60010007 3 4.2 ug/m3 LC
6 01/06/2022 AQS 60010007 3 3.8 ug/m3 LC
Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1 58 Livermore 1 100
2 60 Livermore 1 100
3 39 Livermore 1 100
4 21 Livermore 1 100
5 23 Livermore 1 100
6 21 Livermore 1 100
AQS.Parameter.Code AQS.Parameter.Description Method.Code
1 88101 PM2.5 - Local Conditions 170
2 88101 PM2.5 - Local Conditions 170
3 88101 PM2.5 - Local Conditions 170
4 88101 PM2.5 - Local Conditions 170
5 88101 PM2.5 - Local Conditions 170
6 88101 PM2.5 - Local Conditions 170
Method.Description CBSA.Code
1 Met One BAM-1020 Mass Monitor w/VSCC 41860
2 Met One BAM-1020 Mass Monitor w/VSCC 41860
3 Met One BAM-1020 Mass Monitor w/VSCC 41860
4 Met One BAM-1020 Mass Monitor w/VSCC 41860
5 Met One BAM-1020 Mass Monitor w/VSCC 41860
6 Met One BAM-1020 Mass Monitor w/VSCC 41860
CBSA.Name State.FIPS.Code State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA 6 California 1
2 San Francisco-Oakland-Hayward, CA 6 California 1
3 San Francisco-Oakland-Hayward, CA 6 California 1
4 San Francisco-Oakland-Hayward, CA 6 California 1
5 San Francisco-Oakland-Hayward, CA 6 California 1
6 San Francisco-Oakland-Hayward, CA 6 California 1
County Site.Latitude Site.Longitude
1 Alameda 37.68753 -121.7842
2 Alameda 37.68753 -121.7842
3 Alameda 37.68753 -121.7842
4 Alameda 37.68753 -121.7842
5 Alameda 37.68753 -121.7842
6 Alameda 37.68753 -121.7842
tail(epa22) Date Source Site.ID POC Daily.Mean.PM2.5.Concentration Units
59913 12/01/2022 AQS 61131003 1 3.4 ug/m3 LC
59914 12/07/2022 AQS 61131003 1 3.8 ug/m3 LC
59915 12/13/2022 AQS 61131003 1 6.0 ug/m3 LC
59916 12/19/2022 AQS 61131003 1 34.8 ug/m3 LC
59917 12/25/2022 AQS 61131003 1 23.2 ug/m3 LC
59918 12/31/2022 AQS 61131003 1 1.0 ug/m3 LC
Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
59913 19 Woodland-Gibson Road 1 100
59914 21 Woodland-Gibson Road 1 100
59915 33 Woodland-Gibson Road 1 100
59916 99 Woodland-Gibson Road 1 100
59917 77 Woodland-Gibson Road 1 100
59918 6 Woodland-Gibson Road 1 100
AQS.Parameter.Code AQS.Parameter.Description Method.Code
59913 88101 PM2.5 - Local Conditions 145
59914 88101 PM2.5 - Local Conditions 145
59915 88101 PM2.5 - Local Conditions 145
59916 88101 PM2.5 - Local Conditions 145
59917 88101 PM2.5 - Local Conditions 145
59918 88101 PM2.5 - Local Conditions 145
Method.Description CBSA.Code
59913 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59914 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59915 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59916 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59917 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
59918 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
CBSA.Name State.FIPS.Code State
59913 Sacramento--Roseville--Arden-Arcade, CA 6 California
59914 Sacramento--Roseville--Arden-Arcade, CA 6 California
59915 Sacramento--Roseville--Arden-Arcade, CA 6 California
59916 Sacramento--Roseville--Arden-Arcade, CA 6 California
59917 Sacramento--Roseville--Arden-Arcade, CA 6 California
59918 Sacramento--Roseville--Arden-Arcade, CA 6 California
County.FIPS.Code County Site.Latitude Site.Longitude
59913 113 Yolo 38.66121 -121.7327
59914 113 Yolo 38.66121 -121.7327
59915 113 Yolo 38.66121 -121.7327
59916 113 Yolo 38.66121 -121.7327
59917 113 Yolo 38.66121 -121.7327
59918 113 Yolo 38.66121 -121.7327
head(epa02) Date Source Site.ID POC Daily.Mean.PM2.5.Concentration Units
1 01/05/2002 AQS 60010007 1 25.1 ug/m3 LC
2 01/06/2002 AQS 60010007 1 31.6 ug/m3 LC
3 01/08/2002 AQS 60010007 1 21.4 ug/m3 LC
4 01/11/2002 AQS 60010007 1 25.9 ug/m3 LC
5 01/14/2002 AQS 60010007 1 34.5 ug/m3 LC
6 01/17/2002 AQS 60010007 1 41.0 ug/m3 LC
Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1 81 Livermore 1 100
2 93 Livermore 1 100
3 74 Livermore 1 100
4 82 Livermore 1 100
5 98 Livermore 1 100
6 115 Livermore 1 100
AQS.Parameter.Code AQS.Parameter.Description Method.Code
1 88101 PM2.5 - Local Conditions 120
2 88101 PM2.5 - Local Conditions 120
3 88101 PM2.5 - Local Conditions 120
4 88101 PM2.5 - Local Conditions 120
5 88101 PM2.5 - Local Conditions 120
6 88101 PM2.5 - Local Conditions 120
Method.Description CBSA.Code
1 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
2 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
3 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
4 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
5 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
6 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
CBSA.Name State.FIPS.Code State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA 6 California 1
2 San Francisco-Oakland-Hayward, CA 6 California 1
3 San Francisco-Oakland-Hayward, CA 6 California 1
4 San Francisco-Oakland-Hayward, CA 6 California 1
5 San Francisco-Oakland-Hayward, CA 6 California 1
6 San Francisco-Oakland-Hayward, CA 6 California 1
County Site.Latitude Site.Longitude
1 Alameda 37.68753 -121.7842
2 Alameda 37.68753 -121.7842
3 Alameda 37.68753 -121.7842
4 Alameda 37.68753 -121.7842
5 Alameda 37.68753 -121.7842
6 Alameda 37.68753 -121.7842
tail(epa02) Date Source Site.ID POC Daily.Mean.PM2.5.Concentration Units
15971 12/10/2002 AQS 61131003 1 15 ug/m3 LC
15972 12/13/2002 AQS 61131003 1 15 ug/m3 LC
15973 12/22/2002 AQS 61131003 1 1 ug/m3 LC
15974 12/25/2002 AQS 61131003 1 23 ug/m3 LC
15975 12/28/2002 AQS 61131003 1 5 ug/m3 LC
15976 12/31/2002 AQS 61131003 1 6 ug/m3 LC
Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
15971 62 Woodland-Gibson Road 1 100
15972 62 Woodland-Gibson Road 1 100
15973 6 Woodland-Gibson Road 1 100
15974 77 Woodland-Gibson Road 1 100
15975 28 Woodland-Gibson Road 1 100
15976 33 Woodland-Gibson Road 1 100
AQS.Parameter.Code AQS.Parameter.Description Method.Code
15971 88101 PM2.5 - Local Conditions 117
15972 88101 PM2.5 - Local Conditions 117
15973 88101 PM2.5 - Local Conditions 117
15974 88101 PM2.5 - Local Conditions 117
15975 88101 PM2.5 - Local Conditions 117
15976 88101 PM2.5 - Local Conditions 117
Method.Description CBSA.Code
15971 R & P Model 2000 PM2.5 Sampler w/WINS 40900
15972 R & P Model 2000 PM2.5 Sampler w/WINS 40900
15973 R & P Model 2000 PM2.5 Sampler w/WINS 40900
15974 R & P Model 2000 PM2.5 Sampler w/WINS 40900
15975 R & P Model 2000 PM2.5 Sampler w/WINS 40900
15976 R & P Model 2000 PM2.5 Sampler w/WINS 40900
CBSA.Name State.FIPS.Code State
15971 Sacramento--Roseville--Arden-Arcade, CA 6 California
15972 Sacramento--Roseville--Arden-Arcade, CA 6 California
15973 Sacramento--Roseville--Arden-Arcade, CA 6 California
15974 Sacramento--Roseville--Arden-Arcade, CA 6 California
15975 Sacramento--Roseville--Arden-Arcade, CA 6 California
15976 Sacramento--Roseville--Arden-Arcade, CA 6 California
County.FIPS.Code County Site.Latitude Site.Longitude
15971 113 Yolo 38.66121 -121.7327
15972 113 Yolo 38.66121 -121.7327
15973 113 Yolo 38.66121 -121.7327
15974 113 Yolo 38.66121 -121.7327
15975 113 Yolo 38.66121 -121.7327
15976 113 Yolo 38.66121 -121.7327
EPA 2022 data ranges from 01/01/2022- 12/31/2022 & values are stored under the variable “Date”. There seems to be a sudden spike in PM2.5 values on the 19th and 25th of December.
EPA 2002 data ranges from 01/05/2002- 12/31/2002 & values are stored under the variable “Date”. The air quality data in 2002, upon initial inspection, seems to indicate higher PM2.5 at the start of the year and lowered levels towards the end, and a similar spike is seen on the 25th of December.
1.5 Visualizing the distribution of the key variable- PM2.5 To visualize the distribution of PM2.5, we can plot a quick histogram:
# 2022 Data
hist(epa22$Daily.Mean.PM2.5.Concentration, breaks = 40,
main = "PM2.5 in 2022 CA", xlab = "PM2.5 ug/m3")hist(epa02$Daily.Mean.PM2.5.Concentration, breaks = 40,
main = "PM2.5 in 2002 CA", xlab = "PM2.5 ug/m3")It looks like the PM2.5 in 2022 lies strongly towards the 0-50 range, with a majority (visually) between 0-10 ug/m3. In 2002, a lot of observations fall between 0-40 ug/m3, but some values lie beyond 40 ug/m3.
The EDA steps 1-5 are now complete.
Combining 2002 & 2022 data, and changing the name of the PM2.5 variable.
Since the variables and their types match 1-1 between 2002 and 2022, it makes sense to combine them into the same data frame for convenience. And, since the PM2.5 variable is currently too long to reference each time (“Daily.Mean.PM2.5.Concentration”), it makes it easier to change the name to something more accessible. I am choosing “MeanPM”. I will be using Roger Peng’s EDA manual to write some of my code.
2.1 Combining both data frames
# We need dplyr for this action
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Using the bind_rows function to merge both data frames
epa_all <- bind_rows(epa02, epa22)
# Checking if the binding was successful using head and tail
head(epa_all) Date Source Site.ID POC Daily.Mean.PM2.5.Concentration Units
1 01/05/2002 AQS 60010007 1 25.1 ug/m3 LC
2 01/06/2002 AQS 60010007 1 31.6 ug/m3 LC
3 01/08/2002 AQS 60010007 1 21.4 ug/m3 LC
4 01/11/2002 AQS 60010007 1 25.9 ug/m3 LC
5 01/14/2002 AQS 60010007 1 34.5 ug/m3 LC
6 01/17/2002 AQS 60010007 1 41.0 ug/m3 LC
Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1 81 Livermore 1 100
2 93 Livermore 1 100
3 74 Livermore 1 100
4 82 Livermore 1 100
5 98 Livermore 1 100
6 115 Livermore 1 100
AQS.Parameter.Code AQS.Parameter.Description Method.Code
1 88101 PM2.5 - Local Conditions 120
2 88101 PM2.5 - Local Conditions 120
3 88101 PM2.5 - Local Conditions 120
4 88101 PM2.5 - Local Conditions 120
5 88101 PM2.5 - Local Conditions 120
6 88101 PM2.5 - Local Conditions 120
Method.Description CBSA.Code
1 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
2 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
3 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
4 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
5 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
6 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS 41860
CBSA.Name State.FIPS.Code State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA 6 California 1
2 San Francisco-Oakland-Hayward, CA 6 California 1
3 San Francisco-Oakland-Hayward, CA 6 California 1
4 San Francisco-Oakland-Hayward, CA 6 California 1
5 San Francisco-Oakland-Hayward, CA 6 California 1
6 San Francisco-Oakland-Hayward, CA 6 California 1
County Site.Latitude Site.Longitude
1 Alameda 37.68753 -121.7842
2 Alameda 37.68753 -121.7842
3 Alameda 37.68753 -121.7842
4 Alameda 37.68753 -121.7842
5 Alameda 37.68753 -121.7842
6 Alameda 37.68753 -121.7842
tail(epa_all) Date Source Site.ID POC Daily.Mean.PM2.5.Concentration Units
75889 12/01/2022 AQS 61131003 1 3.4 ug/m3 LC
75890 12/07/2022 AQS 61131003 1 3.8 ug/m3 LC
75891 12/13/2022 AQS 61131003 1 6.0 ug/m3 LC
75892 12/19/2022 AQS 61131003 1 34.8 ug/m3 LC
75893 12/25/2022 AQS 61131003 1 23.2 ug/m3 LC
75894 12/31/2022 AQS 61131003 1 1.0 ug/m3 LC
Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
75889 19 Woodland-Gibson Road 1 100
75890 21 Woodland-Gibson Road 1 100
75891 33 Woodland-Gibson Road 1 100
75892 99 Woodland-Gibson Road 1 100
75893 77 Woodland-Gibson Road 1 100
75894 6 Woodland-Gibson Road 1 100
AQS.Parameter.Code AQS.Parameter.Description Method.Code
75889 88101 PM2.5 - Local Conditions 145
75890 88101 PM2.5 - Local Conditions 145
75891 88101 PM2.5 - Local Conditions 145
75892 88101 PM2.5 - Local Conditions 145
75893 88101 PM2.5 - Local Conditions 145
75894 88101 PM2.5 - Local Conditions 145
Method.Description CBSA.Code
75889 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
75890 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
75891 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
75892 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
75893 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
75894 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC 40900
CBSA.Name State.FIPS.Code State
75889 Sacramento--Roseville--Arden-Arcade, CA 6 California
75890 Sacramento--Roseville--Arden-Arcade, CA 6 California
75891 Sacramento--Roseville--Arden-Arcade, CA 6 California
75892 Sacramento--Roseville--Arden-Arcade, CA 6 California
75893 Sacramento--Roseville--Arden-Arcade, CA 6 California
75894 Sacramento--Roseville--Arden-Arcade, CA 6 California
County.FIPS.Code County Site.Latitude Site.Longitude
75889 113 Yolo 38.66121 -121.7327
75890 113 Yolo 38.66121 -121.7327
75891 113 Yolo 38.66121 -121.7327
75892 113 Yolo 38.66121 -121.7327
75893 113 Yolo 38.66121 -121.7327
75894 113 Yolo 38.66121 -121.7327
head(epa_all)- Data from 2002 visible tail(epa_all)- Data from 2022 visible Interpretation- The merge was successful.
2.2 Creating a column for “Year” from “Date”
# We need lubridate for this action
library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
epa_all$Year <- year(mdy(epa_all$Date)) # The reason is used mdy here is
# because Date is stored as a character value in the original data frame
# Checking to see if it worked
head(epa_all$Year)[1] 2002 2002 2002 2002 2002 2002
tail(epa_all$Year)[1] 2022 2022 2022 2022 2022 2022
2.3 Renaming the PM2.5 variable
# We need dplyr for this action
library(dplyr)
# Renaming the column
epa_all <- rename(epa_all, MeanPM = Daily.Mean.PM2.5.Concentration)
# Checking to see if it worked
head(epa_all$MeanPM)[1] 25.1 31.6 21.4 25.9 34.5 41.0
tail(epa_all$MeanPM)[1] 3.4 3.8 6.0 34.8 23.2 1.0
The MeanPM (previously Daily.Mean.PM2.5.Concentration) column now reflects the PM2.5 values.
Creating a basic map(s) of the locations of the monitoring sites, using different colors for each year. For this step, I will be using leaflet.
# Loading in the libraries I will need
library(dplyr)
library(leaflet)
sites <- epa_all %>% distinct(Year, Site.ID, Site.Latitude, Site.Longitude)
leaflet() %>%
addTiles() %>%
addCircleMarkers(data = filter(sites, Year == 2002), lng = ~Site.Longitude, lat = ~Site.Latitude, color = "blue", fillOpacity = 0, radius = 6) %>%
addCircleMarkers(data = filter(sites, Year == 2022), lng = ~Site.Longitude, lat = ~Site.Latitude, color = "red", fillOpacity = 0.9, radius = 5)Summary: The 2002 and 2022 data are well distributed throughout the state of California, but the 2022 data is much denser, and has more data points as compared to 2002. I would say the overall distribution is stable. Since I made the 2002 data points ring-shaped, I can easily visualize the overlapping sites on my map.
Finding missing or implausible values of PM2.5, calculate the proportions of such data per year and then finding the temporal distribution of missing/implausible data.
For this, we can 1. Find the number of missing/NA values per year. 2. Define a plausible range for PM2.5, and then find the values that lie outside of it. 3. Find the total number of PM2.5 values (including the NA and implausible entries) for the years 2002 and 2022, and then use a simple division to find the proportion.
# NA values per year: Using dplyr to summarise the NA values per year
library(dplyr)
epa_all %>% group_by(Year) %>% summarise(missing_PM = sum(is.na(MeanPM)),
total = n(),
propor_missing = missing_PM / total)# A tibble: 2 × 4
Year missing_PM total propor_missing
<dbl> <int> <int> <dbl>
1 2002 0 15976 0
2 2022 0 59918 0
There seems to be no NA values in either year, which is a good sign.
“Implausible” values of PM2.5 can vary based on the definition of implausibility. Any zero or negative PM2.5 value does not make sense logically, and thus, the lower range of implausibility can be set at <= 0. For the upper range of implausibility, EPA standards suggest that any PM2.5 values greater than 200 are highly unhealthy (hazardous above 300). My immediate thought, is that the State of California has historically poor air quality metrics, and also experiences regular wildfires which can add to the PM2.5 levels. So, in order to decide if an upper level is to be set for implausibility, I would like to look at the dates when PM2.5 > 200 were observed, and see if those dates correlate with any known high-pollution events during that year in California.
MeanPMImp22 <- epa_all[epa_all$MeanPM > 200 & epa_all$Year == 2022, ]
MeanPMImp02 <- epa_all[epa_all$MeanPM > 200 & epa_all$Year == 2002, ]
dim(MeanPMImp22)[1] 7 23
dim(MeanPMImp02)[1] 0 23
We see that there are no PM2.5 values > 200 in 2002, and 7 such values in 2022. Let’s try and see what dates were these 7 values were recorded.
library(dplyr)
library(ggplot2)
MeanPMImp22 <- MeanPMImp22 %>% select(MeanPM, Date, County)
View(MeanPMImp22)
epa_22 <- epa_all %>% filter(Year == 2022)
ggplot(epa_22, aes(x = as.Date(Date, format = "%m/%d/%Y"), y = MeanPM)) +
geom_line(color = "blue") +
geom_point(data = subset(epa_22, MeanPM > 200),
aes(x = as.Date(Date, format = "%m/%d/%Y"), y = MeanPM),
color = "red") +
labs(title = "Daily Mean PM2.5 in CA, 2022", x = "Date", y = "PM2.5 ug/m3")A quick glance at the Calfire website, searching for fire incidents in these counties in 2022 shows me that there have been wildfires/structural fires close to the recorded dates of PM values > 200, so I will retain these values, and only remove <= 0 PM2.5 values from epa_all.
library(dplyr)
epa_all <- filter(epa_all, MeanPM > 0)
summary(epa_all$MeanPM) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.10 4.50 7.60 10.08 12.20 302.50
We now see that the new minimum for epa_all is 0.10 and that all rows where MeanPM <= 0 have been removed.
Exploring the main question of interest (whether PM2.5 values have reduced between 2002 and 2022 in California) at three levels of spatial resolution
5.1: State
For the state-level visualization, it would be useful to see a boxplot distributed by year.
# Required libraries
library(dplyr)
library(ggplot2)
library(tidyr)
ggplot(epa_all, aes(factor(epa_all$Year), epa_all$MeanPM)) + geom_boxplot() +
labs(x = "Year", y = "PM2.5 (ug/m3)",
title = "California Daily PM2.5 by year")Warning: Use of `epa_all$Year` is discouraged.
ℹ Use `Year` instead.
Warning: Use of `epa_all$MeanPM` is discouraged.
ℹ Use `MeanPM` instead.
We could also visualise this as a violin plot
ggplot(epa_all, aes(factor(Year), MeanPM)) +
geom_violin(fill = "lightblue", color = "black") +
labs(x = "Year", y = "PM2.5 ug/m3",
title = "California Daily PM2.5 Distribution by Year")Now for some summary statistics:
state_level <- epa_all %>% group_by(Year) %>%
summarise(mean_PM = mean(MeanPM, na.rm = TRUE),
median_PM = median(MeanPM, na.rm = TRUE),
min_PM = min(MeanPM, na.rm = TRUE),
max_PM = max(MeanPM, na.rm = TRUE))
print(state_level)# A tibble: 2 × 5
Year mean_PM median_PM min_PM max_PM
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2002 16.1 12 0.1 104.
2 2022 8.47 6.9 0.1 302.
Interpretation: Both mean and median PM2.5 values decreased in 2022 compared to 2002, even though the maximum PM2.5 recorded in 2022 was much higher than 2002. The violin plot has a long tail, which represents the extreme values of PM2.5 recorded in 2022.
5.2: County
For the county-level visualization, it might be useful to look at a slopegraph, which is commonly used in biological data analyses to look at changes in gene expression levels (for small sets of genes) over time. (slopegraphs aren’t usually used for large amounts of data (like tens of thousands of genes), because the plot begins to look messy). The question here, is whether to plot median or mean PM2.5 values. Median usually discounts extreme outliers and represents the way data looks independent of small data outliers, while using mean can skew the plotted observation to either extreme, but has the advantage of capturing the variability in the data. I would like to plot both, and compare the two.
library(stringr)
county_data <- epa_all %>% group_by(County, Year) %>%
summarise(med = median(MeanPM), .groups="drop")
ggplot(county_data, aes(x = factor(Year), y = med, group = County)) +
geom_line() + geom_point() + labs(x = NULL, y = "Median PM2.5 ug/m3",
title = "County medians: 2002 to 2022") +
theme(axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text = element_text(size = 7))# To also make this plot a layout, with one plot for each county
ggplot(county_data, aes(x = factor(Year), y = med, group = County)) +
geom_line() + geom_point() + facet_wrap(~ str_wrap(County, width = 30)) +
labs(x = NULL, y = "Median PM2.5 ug/m3", title = "County medians by year") +
theme(axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text = element_text(size = 7))`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
# The same plots, but using Mean
county_data2 <- epa_all %>% group_by(County, Year) %>%
summarise(men = mean(MeanPM), .groups = "drop")
ggplot(county_data2, aes(x = factor(Year), y = men, group = County)) +
geom_line() + geom_point() + labs(x = NULL, y = "Mean PM2.5 ug/m3",
title = "County average: 2002 to 2022") +
theme(axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text = element_text(size = 7))ggplot(county_data2, aes(x = factor(Year), y = men, group = County)) +
geom_line() + geom_point() + facet_wrap(~ str_wrap(County, width = 30)) +
labs(x = NULL, y = "Mean PM2.5 ug/m3", title = "County means by year") +
theme(axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
strip.text = element_text(size = 7))`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
Summarizing the county-level data:
library(stringr)
county_level <- epa_all %>% group_by(County, Year) %>%
summarise(mean_PM = mean(MeanPM, na.rm = TRUE),
median_PM = median(MeanPM, na.rm = TRUE),
.groups = "drop")
most_change_CL <- county_level %>%
pivot_wider(names_from = Year, values_from = c(mean_PM, median_PM)) %>%
mutate(mean_change= mean_PM_2002 - mean_PM_2022,
median_change = median_PM_2002 - median_PM_2022)
most_change_mean <- most_change_CL %>% arrange(desc(mean_change)) %>%
select(County, mean_change)
most_change_median <- most_change_CL %>%
arrange(desc(median_change)) %>% select(County, median_change)
print(county_level)# A tibble: 98 × 4
County Year mean_PM median_PM
<chr> <dbl> <dbl> <dbl>
1 Alameda 2002 14.3 10
2 Alameda 2022 8.22 7
3 Butte 2002 14.8 11.5
4 Butte 2022 6.29 4.5
5 Calaveras 2002 9.9 8
6 Calaveras 2022 6.05 5
7 Colusa 2002 11.7 9
8 Colusa 2022 7.61 6.7
9 Contra Costa 2002 15.1 9.5
10 Contra Costa 2022 8.24 7.2
# ℹ 88 more rows
print(most_change_mean)# A tibble: 51 × 2
County mean_change
<chr> <dbl>
1 Riverside 13.2
2 Kern 13.0
3 Tulare 12.6
4 Merced 11.5
5 Kings 10.3
6 Fresno 9.72
7 San Joaquin 8.73
8 Los Angeles 8.67
9 Orange 8.57
10 Butte 8.47
# ℹ 41 more rows
print(most_change_median)# A tibble: 51 × 2
County median_change
<chr> <dbl>
1 Tulare 11.3
2 Riverside 9.7
3 Kern 9.5
4 Merced 9.1
5 Kings 7.4
6 Los Angeles 7.1
7 Butte 7
8 San Joaquin 6.6
9 Ventura 6.6
10 Orange 6.5
# ℹ 41 more rows
Interpretation: Visually, the slopegraphs show that in most counties in California, the median and mean PM2.5 reduced in 2022 compared to 2002. Using pivot_wider, we can find the county with the most change in median and mean PM2.5. Since in pivot_wider I subtract the mean/median PM of 2022 from 2002, the most positive values represent the most decrease in PM2.5 (mean/median) in 2022.
Riverside saw the best improvement in mean PM2.5, while Tulare saw the best improvement in median PM2.5. This is really interesting, as it seems to highlight the difference between using mean vs. median to analyze data.
5.3: City (Los Angeles County Sites)
For the city-level visualization, I see two different identifiers for Sites- “Local.Site.Name” and “Site.ID”. I want to see how the data is recorded for both variables and decide which one to use.
table(epa_all$Local.Site.Name)
118
29 Palms
767
3425 N FIRST ST, FRESNO
499
Agua Tibia
168
Alpine
467
Alturas-Fourth St.
2
Anaheim
1120
Atascadero
345
Atascadero (original site)
116
Auburn-Atwood
344
Azusa
415
Bakersfield-Airport (Planz)
231
Bakersfield-California
2007
Bakersfield-Golden / M St
420
Banning Airport
365
Big Bear
450
Bishop Tribe EMO
704
Bliss SP
113
Bliss SP (TRPA)
117
Brawley-220 Main Street
348
Brawley-401 Main St.
85
Burbank
122
Calexico-Ethel Street
613
Camp Pendleton
479
Campo Indian Reservation
724
Carmel Valley
343
CDF
359
Chester-222 First Ave
299
Chico-East Avenue
409
Chico-Manzanita Ave.
120
Chula Vista
218
Clovis-Villa
443
Colfax-City Hall
358
Colusa-Sunrise Blvd
454
Compton
723
Concord
739
Corcoran-Patterson
443
Cortina Indian Rancheria
42
Crescent City-Crescent Elk School
332
Crestline
178
Davis-UCD Campus
312
Death Valley NP - Park Village
115
Donovan
507
Echo Summit
2
El Cajon
416
El Cajon - Lexington Elementary School
631
El Centro-9th Street
450
El Rio-Rio Mesa School #2
469
Escondido
355
Eureka I Street
59
Folsom-Natoma St.
689
Fontana
298
Fremont - Chapel Way
105
Fresno - Garland
533
Fresno-Foundry
357
Fresno-Pacific
443
Gilroy
340
Glendora
361
Goleta
359
Grass Valley-Litton Building
404
Hanford-Irwin
361
Hollister
356
Hoover
228
Huron
357
Indio
240
Jacobs
116
Joshua Tree NP
215
Kaiser
195
Kearny Mesa
111
Keeler
635
King City 2
344
Lake Elsinore
360
Lake Tahoe Community College
111
Lakeport-Lakeport Blvd.
61
Lakeport-S. Main Street
61
Lancaster-Division Street
455
Laney College
360
Lassen Volcanic NP
209
Lava Beds National Monument
216
Lebec
490
Lebec-Peace Valley Road
93
Lebec-Peace Valley/Frazier Park Roads
106
Lee Vining
716
Lincoln-2885 Moore Road
357
Livermore
439
Lompoc H Street
351
Lone Pine Paiute-Shoshone Reservation
604
Long Beach (North)
411
Long Beach (South)
239
Long Beach-Route 710 Near Road
574
Los Angeles-North Main Street
1271
Lynwood
122
Madera-City
360
Mammoth Lakes
110
Manteca
308
Merced-Coffee
359
Merced-M St
449
Mesa2
357
Mira Loma (Van Buren)
777
Mission Viejo
227
Modesto-14th Street
587
Mojave
100
Mojave - CA 58 Business
355
Morongo Air Monitoring Station
370
North Hollywood (NOHO)
365
Oakland
365
Oakland West
359
Ojai - East Ojai Ave
360
Ontario Fire Station
111
Ontario-Route 60 Near Road
530
Pala Airpad
694
Palm Springs
239
Paradise - Theater
347
Pasadena
241
Pechanga
401
Pico Rivera #2
118
Pinnacles NP
238
Piru - Pacific
475
Pleasanton - Owens Ct
361
Point Reyes National Seashore
213
Porterville
356
Portola
485
Portola-161 Nevada Street
100
Quincy-N Church Street
410
Red Bluff-Walnut St. District Office
347
Redding - Buckeye
54
Redding - Health Department
427
Redding - Toyon
59
Redwood City
449
Redwood NP
229
Reseda
602
Ridgecrest-California Ave
104
Ridgecrest-Ward
355
Riverside (Magnolia)
115
Roseville-N Sunrise Ave
409
Rubidoux
1372
Sacramento Health Department-Stockton Blvd.
154
Sacramento-1309 T Street
826
Sacramento-Bercut Drive
359
Sacramento-Del Paso Manor
988
Salinas 3
530
San Andreas-Gold Strike Road
414
San Bernardino
235
San Diego - Kearny Villa Rd.
172
San Diego - Sherman Elementary School
664
San Diego -Rancho Carmel Drive
119
San Diego-12th Ave
352
San Francisco
546
San Gorgonio Wilderness
229
San Jose
217
San Jose - 4th St.
141
San Jose - Jackson
585
San Jose - Knox Avenue
358
San Lorenzo Valley Middle School
345
San Luis Obispo-Marsh St.
52
San Pablo
352
San Rafael
547
Santa Barbara
349
Santa Clarita
365
Santa Cruz
406
Santa Maria
146
Santa Rosa - 5th St
93
Sebastopol
345
Sequoia & Kings Canyon NPs - Ash Mountain
340
Sequoia NP
226
Signal Hill (LBSH)
289
Simi Valley-Cochran Street
873
SLO Roberto
358
Sloughhouse
342
South Lake Tahoe-Sandy Way
93
Stn.1 Big Pine Paiute site
650
Stockton - University Park
348
Stockton-Hazelton
124
Table Mountain Air Monitoring Site
339
Tahoe City-Fairway Drive
355
Temecula
365
Thousand Oaks
512
Torres Martinez Reservation
470
Tracy-Airport
345
TRAFFIC, RURAL PAVED ROAD
700
Tranquillity
349
Trinity
139
Truckee-Fire Station
545
Turlock
340
Ukiah-Library
469
Upland
365
Vallejo
816
Victorville-Park Avenue
938
Visalia-Church
404
Visalia-W. Ashland Avenue
393
Weaverville-Courthouse
350
Willits-Blosser Lane
351
Willows-Colusa Street
328
WMRC/NCORE
755
Woodland-Gibson Road
171
Yosemite NP
340
Yosemite NP-Yosemite Village Vistor Center
522
Yreka
312
Yuba City
826
table(epa_all$Site.ID)
60010007 60010009 60010011 60010012 60010015 60011001 60070002 60070008
439 365 359 360 361 105 120 409
60072002 60074001 60090001 60110007 60111002 60130002 60131004 60150002
347 700 414 42 454 739 352 229
60150007 60170011 60170012 60179000 60179001 60179002 60190008 60190011
332 93 2 113 111 117 499 533
60190500 60192008 60192009 60192016 60195001 60195025 60199000 60210003
339 357 349 357 443 443 195 328
60231002 60231004 60250003 60250005 60250007 60250010 60251003 60270002
59 116 85 613 348 470 450 755
60270101 60271003 60271018 60271023 60271033 60290010 60290011 60290014
115 635 604 704 650 420 100 2007
60290015 60290016 60290018 60290019 60292009 60299001 60299002 60310004
104 231 355 355 340 106 93 443
60311004 60333001 60333002 60370002 60370016 60371002 60371103 60371201
361 61 61 415 361 122 1271 602
60371301 60371302 60371601 60371602 60372005 60374002 60374004 60374008
122 723 118 118 241 411 239 574
60374009 60374010 60376012 60379033 60379034 60392010 60410001 60410002
289 365 365 455 150 360 360 213
60430003 60431001 60450006 60452003 60470003 60472510 60490001 60510001
340 522 469 351 359 449 2 110
60510005 60519000 60530002 60530008 60531003 60570005 60571001 60590007
716 228 343 344 530 404 545 1120
60592022 60610003 60610004 60610006 60611004 60612003 60631006 60631007
227 344 358 409 355 357 410 299
60631009 60631010 60650009 60650012 60650016 60650500 60651003 60651016
100 485 401 365 365 767 115 370
60652002 60655001 60658001 60658005 60659000 60659001 60670006 60670010
240 239 1372 777 168 360 988 826
60670012 60670015 60674001 60675003 60690002 60690003 60710005 60710025
689 359 154 342 356 238 178 111
60710027 60710306 60711004 60712002 60718001 60719002 60719004 60719010
530 938 365 298 450 215 235 229
60730001 60730003 60730006 60730077 60731002 60731006 60731007 60731008
218 416 111 724 355 467 352 479
60731014 60731016 60731017 60731022 60731026 60731201 60750005 60771002
507 172 119 631 664 694 546 124
60771003 60772010 60773005 60792002 60792004 60792007 60792020 60798001
348 308 345 52 357 359 358 116
60798002 60811001 60830011 60831008 60831009 60832004 60832011 60839000
345 449 349 60 86 351 359 187
60850002 60850004 60850005 60850006 60852003 60870007 60871005 60890004
340 141 585 358 217 406 345 427
60893003 60893004 60893005 60930005 60932001 60950004 60970003 60970004
209 54 59 216 312 816 93 345
60990005 60990006 61010003 61030007 61050002 61059000 61070009 61071001
587 340 826 347 350 139 340 226
61072002 61072003 61072010 61110007 61110009 61111004 61112002 61113001
404 393 356 512 475 360 873 469
61130004 61131003
312 171
It seems like the local site names are very variable. Some names are capitalised while others are not, which might make it difficult for data processing. Instead, using the Site ID can ensure consistency.
library(stringr)
la_sites <- epa_all %>% filter(County == "Los Angeles") %>%
group_by(Site.ID, Year, Local.Site.Name) %>%
summarise(med = median(MeanPM, .groups="drop"))`summarise()` has grouped output by 'Site.ID', 'Year'. You can override using
the `.groups` argument.
ggplot(la_sites, aes(x = factor(Year), y = med, group = Site.ID)) + geom_line() +
geom_point() + facet_wrap(~ str_wrap(Local.Site.Name, width = 15),
scales = "free_y") +
labs(x = NULL, y = "Median PM2.5 ug/m3",
title = "LA County site medians 2002 vs 2022")`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
# Repeating the same plots but with mean PM2.5 values.
la_sites_mean <- epa_all %>% filter(County == "Los Angeles") %>% group_by(Site.ID, Year, Local.Site.Name) %>%
summarise(men = mean(MeanPM, .groups="drop"))`summarise()` has grouped output by 'Site.ID', 'Year'. You can override using
the `.groups` argument.
ggplot(la_sites_mean, aes(x = factor(Year), y = men, group = Site.ID)) +
geom_line() + geom_point() +
facet_wrap(~ str_wrap(Local.Site.Name, width = 15), scales = "free_y") +
labs(x = NULL, y = "Mean PM2.5 ug/m3",
title = "LA County site means 2002 vs 2022")`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
Summarizing the site level data:
site_level <- epa_all %>% filter(County == "Los Angeles") %>%
group_by(Local.Site.Name, Year) %>%
summarise(mean_PM = mean(MeanPM, na.rm = TRUE),
median_PM = median(MeanPM, na.rm = TRUE),
.groups = "drop")
most_change_SL <- site_level %>%
pivot_wider(names_from = Year, values_from = c(mean_PM, median_PM)) %>%
mutate(mean_change_SL = mean_PM_2002 - mean_PM_2022,
median_change_SL = median_PM_2002 - median_PM_2022)
most_change_mean_SL <- most_change_SL %>% arrange(desc(mean_change_SL)) %>%
select(Local.Site.Name, mean_change_SL)
most_change_median_SL <- most_change_SL %>% arrange(desc(median_change_SL)) %>%
select(Local.Site.Name, median_change_SL)
print(site_level)# A tibble: 25 × 4
Local.Site.Name Year mean_PM median_PM
<chr> <dbl> <dbl> <dbl>
1 "" 2002 23.9 21.4
2 "Azusa" 2002 20.8 18.7
3 "Azusa" 2022 9.72 9.65
4 "Burbank" 2002 24.0 21.6
5 "Compton" 2022 13.0 11.9
6 "Glendora" 2022 8.52 7.9
7 "Lancaster-Division Street" 2002 10.4 10
8 "Lancaster-Division Street" 2022 7.52 7.3
9 "Lebec" 2002 4.82 4.8
10 "Lebec" 2022 3.50 3.4
# ℹ 15 more rows
print(most_change_mean_SL)# A tibble: 18 × 2
Local.Site.Name mean_change_SL
<chr> <dbl>
1 "Pasadena" 11.2
2 "Azusa" 11.0
3 "Los Angeles-North Main Street" 10.4
4 "Long Beach (North)" 9.55
5 "Reseda" 8.14
6 "Lancaster-Division Street" 2.86
7 "Lebec" 1.32
8 "" NA
9 "Burbank" NA
10 "Compton" NA
11 "Glendora" NA
12 "Long Beach (South)" NA
13 "Long Beach-Route 710 Near Road" NA
14 "Lynwood" NA
15 "North Hollywood (NOHO)" NA
16 "Pico Rivera #2" NA
17 "Santa Clarita" NA
18 "Signal Hill (LBSH)" NA
print(most_change_median_SL)# A tibble: 18 × 2
Local.Site.Name median_change_SL
<chr> <dbl>
1 "Pasadena" 9.9
2 "Azusa" 9.05
3 "Los Angeles-North Main Street" 8.4
4 "Long Beach (North)" 7.25
5 "Reseda" 6.75
6 "Lancaster-Division Street" 2.7
7 "Lebec" 1.4
8 "" NA
9 "Burbank" NA
10 "Compton" NA
11 "Glendora" NA
12 "Long Beach (South)" NA
13 "Long Beach-Route 710 Near Road" NA
14 "Lynwood" NA
15 "North Hollywood (NOHO)" NA
16 "Pico Rivera #2" NA
17 "Santa Clarita" NA
18 "Signal Hill (LBSH)" NA
Interpretation: It seems like although there were no NA values for PM2.5, there are some PM2.5 values recorded without a Local.Site.Name attribute, and some Local.Site.Name attributes for whom PM2.5 was not recorded in one or the other year. Regardless, comparing the line plots for 2002 to 2022 for all sites in LA county, median and mean PM2.5 decreased in 2022. The summary statistics using pivot_wider also show us that Pasadena saw the most improvement in both mean and median PM2.5 in LA county.
From the exploratory data analysis, across CA, daily PM2.5 concentrations decreased from 2002 to 2022. Most counties and LA County sites show declines, with the largest improvements being in Riverside/Tulare. While 2022 had a few extremely polluted days (PM2.5 > 200 ug/m3), these are one-off incidents and not persistent. Overall, the evidence supports that PM2.5 levels decreased between 2002 and 2022 in California.